NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Pretraining and the lasso

https://doi.org/10.1093/jrsssb/qkaf050

Craig, Erin; Pilanci, Mert; Le_Menestrel, Thomas; Narasimhan, Balasubramanian; Rivas, Manuel_A; Gullaksen, Stein-Erik; Dehghannasiri, Roozbeh; Salzman, Julia; Taylor, Jonathan; Tibshirani, Robert (August 2025, Journal of the Royal Statistical Society Series B: Statistical Methodology)

Abstract Pre-training is a powerful paradigm in machine learning to pass information across models. For example, suppose one has a modest-sized dataset of images of cats and dogs and plans to fit a deep neural network to classify them. With pre-training, we start with a neural network trained on a large corpus of images of not just cats and dogs but hundreds of classes. We fix all network weights except the top layer(s) and fine tune on our dataset. This often results in dramatically better performance than training solely on our dataset. Here, we ask: ‘Can pre-training help the lasso?’. We propose a framework where the lasso is fit on a large dataset and then fine-tuned on a smaller dataset. The latter can be a subset of the original, or have a different but related outcome. This framework has a wide variety of applications, including stratified and multi-response models. In the stratified model setting, lasso pre-training first estimates coefficients common to all groups, then estimates group-specific coefficients during fine-tuning. Under appropriate assumptions, support recovery of the common coefficients is superior to the usual lasso trained on individual groups. This separate identification of common and individual coefficients also aids scientific understanding.
more » « less
Cooperative learning for multiview analysis

https://doi.org/10.1073/pnas.2202113119

Ding, Daisy Yi; Li, Shuangning; Narasimhan, Balasubramanian; Tibshirani, Robert (September 2022, Proceedings of the National Academy of Sciences)

We propose a method for supervised learning with multiple sets of features (“views”). The multiview problem is especially important in biology and medicine, where “-omics” data, such as genomics, proteomics, and radiomics, are measured on a common set of samples. “Cooperative learning” combines the usual squared-error loss of predictions with an “agreement” penalty to encourage the predictions from different data views to agree. By varying the weight of the agreement penalty, we get a continuum of solutions that include the well-known early and late fusion approaches. Cooperative learning chooses the degree of agreement (or fusion) in an adaptive manner, using a validation set or cross-validation to estimate test set prediction error. One version of our fitting procedure is modular, where one can choose different fitting mechanisms (e.g., lasso, random forests, boosting, or neural networks) appropriate for different data views. In the setting of cooperative regularized linear regression, the method combines the lasso penalty with the agreement penalty, yielding feature sparsity. The method can be especially powerful when the different data views share some underlying relationship in their signals that can be exploited to boost the signals. We show that cooperative learning achieves higher predictive accuracy on simulated data and real multiomics examples of labor-onset prediction. By leveraging aligned signals and allowing flexible fitting mechanisms for different modalities, cooperative learning offers a powerful approach to multiomics data fusion.
more » « less
Full Text Available
Evaluation of individual and ensemble probabilistic forecasts of COVID-19 mortality in the United States

https://doi.org/10.1073/pnas.2113561119

Cramer, Estee Y.; Ray, Evan L.; Lopez, Velma K.; Bracher, Johannes; Brennen, Andrea; Castro Rivadeneira, Alvaro J.; Gerding, Aaron; Gneiting, Tilmann; House, Katie H.; Huang, Yuxin; et al (April 2022, Proceedings of the National Academy of Sciences)

Short-term probabilistic forecasts of the trajectory of the COVID-19 pandemic in the United States have served as a visible and important communication channel between the scientific modeling community and both the general public and decision-makers. Forecasting models provide specific, quantitative, and evaluable predictions that inform short-term decisions such as healthcare staffing needs, school closures, and allocation of medical supplies. Starting in April 2020, the US COVID-19 Forecast Hub ( https://covid19forecasthub.org/ ) collected, disseminated, and synthesized tens of millions of specific predictions from more than 90 different academic, industry, and independent research groups. A multimodel ensemble forecast that combined predictions from dozens of groups every week provided the most consistently accurate probabilistic forecasts of incident deaths due to COVID-19 at the state and national level from April 2020 through October 2021. The performance of 27 individual models that submitted complete forecasts of COVID-19 deaths consistently throughout this year showed high variability in forecast skill across time, geospatial units, and forecast horizons. Two-thirds of the models evaluated showed better accuracy than a naïve baseline model. Forecast accuracy degraded as models made predictions further into the future, with probabilistic error at a 20-wk horizon three to five times larger than when predicting at a 1-wk horizon. This project underscores the role that collaboration and active coordination between governmental public-health agencies, academic modeling teams, and industry partners can play in developing modern modeling capabilities to support local, state, and federal response to outbreaks.
more » « less
Full Text Available
The United States COVID-19 Forecast Hub dataset

https://doi.org/10.1038/s41597-022-01517-w

Cramer, Estee Y.; Huang, Yuxin; Wang, Yijin; Ray, Evan L.; Cornell, Matthew; Bracher, Johannes; Brennen, Andrea; Rivadeneira, Alvaro J.; Gerding, Aaron; House, Katie; et al (December 2022, Scientific Data)

Abstract Academic researchers, government agencies, industry groups, and individuals have produced forecasts at an unprecedented scale during the COVID-19 pandemic. To leverage these forecasts, the United States Centers for Disease Control and Prevention (CDC) partnered with an academic research lab at the University of Massachusetts Amherst to create the US COVID-19 Forecast Hub. Launched in April 2020, the Forecast Hub is a dataset with point and probabilistic forecasts of incident cases, incident hospitalizations, incident deaths, and cumulative deaths due to COVID-19 at county, state, and national, levels in the United States. Included forecasts represent a variety of modeling approaches, data sources, and assumptions regarding the spread of COVID-19. The goal of this dataset is to establish a standardized and comparable set of short-term forecasts from modeling teams. These data can be used to develop ensemble models, communicate forecasts to the public, create visualizations, compare models, and inform policies regarding COVID-19 mitigation. These open-source data are available via download from GitHub, through an online API, and through R packages.
more » « less
Full Text Available

Search for: All records